2,354 research outputs found
Measuring Cluster Stability for Bayesian Nonparametrics Using the Linear Bootstrap
Clustering procedures typically estimate which data points are clustered
together, a quantity of primary importance in many analyses. Often used as a
preliminary step for dimensionality reduction or to facilitate interpretation,
finding robust and stable clusters is often crucial for appropriate for
downstream analysis. In the present work, we consider Bayesian nonparametric
(BNP) models, a particularly popular set of Bayesian models for clustering due
to their flexibility. Because of its complexity, the Bayesian posterior often
cannot be computed exactly, and approximations must be employed. Mean-field
variational Bayes forms a posterior approximation by solving an optimization
problem and is widely used due to its speed. An exact BNP posterior might vary
dramatically when presented with different data. As such, stability and
robustness of the clustering should be assessed.
A popular mean to assess stability is to apply the bootstrap by resampling
the data, and rerun the clustering for each simulated data set. The time cost
is thus often very expensive, especially for the sort of exploratory analysis
where clustering is typically used. We propose to use a fast and automatic
approximation to the full bootstrap called the "linear bootstrap", which can be
seen by local data perturbation. In this work, we demonstrate how to apply this
idea to a data analysis pipeline, consisting of an MFVB approximation to a BNP
clustering posterior of time course gene expression data. We show that using
auto-differentiation tools, the necessary calculations can be done
automatically, and that the linear bootstrap is a fast but approximate
alternative to the bootstrap.Comment: 9 pages, NIPS 2017 Advances in Approximate Bayesian Inference
Worksho
Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box
Automatic differentiation variational inference (ADVI) offers fast and
easy-to-use posterior approximation in multiple modern probabilistic
programming languages. However, its stochastic optimizer lacks clear
convergence criteria and requires tuning parameters. Moreover, ADVI inherits
the poor posterior uncertainty estimates of mean-field variational Bayes
(MFVB). We introduce ``deterministic ADVI'' (DADVI) to address these issues.
DADVI replaces the intractable MFVB objective with a fixed Monte Carlo
approximation, a technique known in the stochastic optimization literature as
the ``sample average approximation'' (SAA). By optimizing an approximate but
deterministic objective, DADVI can use off-the-shelf second-order optimization,
and, unlike standard mean-field ADVI, is amenable to more accurate posterior
covariances via linear response (LR). In contrast to existing worst-case
theory, we show that, on certain classes of common statistical problems, DADVI
and the SAA can perform well with relatively few samples even in very high
dimensions, though we also show that such favorable results cannot extend to
variational approximations that are too expressive relative to mean-field ADVI.
We show on a variety of real-world problems that DADVI reliably finds good
solutions with default settings (unlike ADVI) and, together with LR
covariances, is typically faster and more accurate than standard ADVI.Comment: 38 page
An Automatic Finite-Sample Robustness Metric: When Can Dropping a Little Data Make a Big Difference?
We propose a method to assess the sensitivity of econometric analyses to the
removal of a small fraction of the data. Manually checking the influence of all
possible small subsets is computationally infeasible, so we provide an
approximation to find the most influential subset. Our metric, the "Approximate
Maximum Influence Perturbation," is automatically computable for common methods
including (but not limited to) OLS, IV, MLE, GMM, and variational Bayes. We
provide finite-sample error bounds on approximation performance. At minimal
extra cost, we provide an exact finite-sample lower bound on sensitivity. We
find that sensitivity is driven by a signal-to-noise ratio in the inference
problem, is not reflected in standard errors, does not disappear
asymptotically, and is not due to misspecification. While some empirical
applications are robust, results of several economics papers can be overturned
by removing less than 1% of the sample.Comment: 47 pages. Submitted to Econometric
Linear response methods for accurate covariance estimates from mean field variational bayes
Mean field variational Bayes (MFVB) is a popular posterior approximation method due to its fast runtime on large-scale data sets. However, a well known failing of MFVB is that it underestimates the uncertainty of model variables (sometimes severely) and provides no information about model variable covariance. We generalize linear response methods from statistical physics to deliver accurate uncertainty estimates for model variables---both for individual variables and coherently across variables. We call our method linear response variational Bayes (LRVB). When the MFVB posterior approximation is in the exponential family, LRVB has a simple, analytic form, even for non-conjugate models. Indeed, we make no assumptions about the form of the true posterior. We demonstrate the accuracy and scalability of our method on a range of models for both simulated and real data
- …